Skip to content

[WIP] feat(lora): LoRA adapter serving#83

Draft
qywu wants to merge 69 commits into
lightseekorg:mainfrom
qywu:feat/lora-adapter-serving
Draft

[WIP] feat(lora): LoRA adapter serving#83
qywu wants to merge 69 commits into
lightseekorg:mainfrom
qywu:feat/lora-adapter-serving

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 11, 2026

Summary (WIP)

End-to-end LoRA adapter serving for tokenspeed. Branch is not yet rebased on current main — many test files appear as deletions because the last merge from main predates several recent PRs (#18, #51, etc.). Will refresh before un-drafting.

What's in this PR

  • Scaffolding: feat(lora): scaffold LoRA adapter serving infrastructure.
  • Prefix-cache namespacing (C++): per-adapter namespacing in the scheduler so two adapters with the same prompt don't collide on cached KV.
  • HiCache wiring: thread lora_id through hybrid cache paths.
  • LoraManager: GPU weight pool with LRU eviction, TP-aware adapter application.
  • HTTP plumbing: lora_path accepted on /v1/completions and /v1/chat/completions; propagated through GenerateReqInput.__getitem__.
  • MLP target support: gate_proj / up_proj / down_proj LoRA targets in addition to attention QKV/output.
  • CUDA-graph support: segment-grouped Triton kernels; separate no-LoRA graph variant captured so base-only batches skip the LoRA path.
  • Tiered pool: GPU ↔ CPU ↔ disk pool with async prefetch.
  • Pack scheduling: pack policy + cold/warm latency benchmark.
  • Eager-mode fixes: --enable-lora works without CUDA graphs.
  • Misc perf: drop pure-PyTorch RMSNorm fallback in qk_norm; evict adapter namespace on unload.
  • Docs: HTML references for the LoRA implementation and the broader tokenspeed codebase structure.

Status

This is an early draft — opening for visibility and review of the overall shape. Next steps before un-drafting:

Test plan

  • C++ unit test: test_lora_prefix_cache.cpp.
  • Python E2E: load base + 2 adapters, verify per-adapter outputs, prefix-cache namespace isolation.
  • TP=2 sanity once the dense-MLP TP fix from PR fix(qwen3): plumb tensor-parallel info through MLP layers #80 is merged (already in main, this branch needs a rebase to pick it up).

qywu and others added 30 commits May 7, 2026 08:14
Adds the foundational types and API surface for PEFT-style LoRA adapter
serving, unblocking the full runtime implementation.

New files:
  python/tokenspeed/runtime/lora/lora_config.py  — LoraConfig dataclass;
    loads from PEFT adapter_config.json; exposes r, lora_alpha, scaling.
  python/tokenspeed/runtime/lora/lora_registry.py — LoraRegistry tracks
    loaded adapters, maps names to stable integer IDs, enforces max_loras
    capacity (pinned adapters bypass the limit).
  python/tokenspeed/runtime/lora/__init__.py

API additions:
  GenerateReqInput.lora_path — per-request adapter selector (name or path).
  ServerArgs: --enable-lora, --max-loras, --max-lora-rank.
  EngineBase.load_lora_adapter() / unload_lora_adapter() — abstract API
    with NotImplementedError stubs; full implementation tracked in PR #2.

Tests:
  test/runtime/lora/test_lora_registry.py — 11 unit tests covering
    registration, capacity enforcement, pinning, unregister, scaling.

TODO (tracked in PR):
  - LoraManager: weight loading from safetensors into pre-allocated GPU
    buffers (one buffer per target module × max_lora_rank).
  - Request routing: resolve lora_path → lora_id in scheduler.
  - Batched LoRA matmuls (sgmv / punica kernels or torch fallback).
  - Engine.load/unload implementations calling LoraManager.
  - OpenAI API: expose lora_path in /v1/completions and /v1/chat/completions.
  - C++ scheduler: pass lora_id on requests for prefix-cache namespacing.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…heduler

Implements the correct LoRA prefix cache namespace so:
  • Same adapter + same tokens  → cache hit  ✓
  • Different adapters + same tokens → no cross-adapter hit ✓

Design: per-adapter virtual root node
  For each lora_id > 0, KVPrefixCache::getOrCreateLoraRoot() creates a child
  of the real root keyed by a one-page sentinel token [-lora_id, 0, ..., 0].
  Negative token IDs never appear in real vocabularies (non-negative), so
  there is no collision between adapters or with the base-model namespace.

  An empty DeviceResource is attached to the virtual root so:
    • OnDevice() == true → PruneEmptyByNode never removes it
    • IsLeaf() == false  → eviction never tries to evict it

  KVPrefixCache::Match() and Insert() accept a lora_id parameter (default 0)
  and call resolveStartNode() to obtain the correct namespace root.

  MatchResult::Device::namespace_depth_offset (new field, default 0) is set
  to 1 for LoRA requests and subtracted inside DepthInPage() so all callers
  see the number of real matched token pages, not including the sentinel page.

Changes:
  request_spec.h            — add lora_id: int32_t = 0
  request.h / request.cpp   — store + expose LoraId()
  kv_prefix_cache.h/cpp     — getOrCreateLoraRoot, resolveStartNode,
                               lora_id param on Match + Insert
  types.h / types.cpp       — namespace_depth_offset in MatchResult
  forward_events.h/cpp      — FinishEvent carries lora_id_, passes to Insert/Match
  forward.cpp               — pass request->LoraId() to all Match calls
  outside_event_handler.cpp — pass req->LoraId() to FinishEvent
  python_module.cpp         — expose lora_id on Python RequestSpec

Tests (test_lora_prefix_cache.cpp, 6 cases):
  SameAdapterReusesPrefixCache
  DifferentAdaptersDontShareCache
  BaseModelIndependentOfAdapters
  MultipleAdaptersCacheIndependently
  InsertLastNodeIsInAdapterNamespace
  EvictionDoesNotCrossNamespaces

All 120 C++ tests pass.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Three paths were missing lora_id, causing cross-adapter KV cache collisions
when the hybrid (Mamba / HiCache) prefix cache is enabled:

1. HybridPrefixCache::Match() — added lora_id param, passes through to
   KVPrefixCache::Match() so the per-adapter virtual root is used for L2
   host-cache matching as well as device matching.

2. InsertHybridCache() — added lora_id param, passes through to
   KVPrefixCache::Insert() so chunked-prefill inserts land in the correct
   adapter namespace (previously always defaulted to kLoraNone).

3. SchedulePrefillEvent / ScheduleDecodeEvent — added lora_id_ field;
   forward.cpp passes request->LoraId() at construction time.
   Both events call InsertHybridCache() and now supply the adapter id.

Also fixes the schedulePrefillFirstChunk hybrid-path Match call which was
passing lora_id only on the non-hybrid branch.

All 120 C++ tests pass.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…lication

Implements the weight management layer for LoRA adapter serving.

LoraManager (python/tokenspeed/runtime/lora/lora_manager.py)
  Pre-allocates a fixed GPU buffer with max_loras+1 slots (slot 0 = base model).
  load_adapter(name, path): loads PEFT safetensors to CPU, computes scaling
    from adapter_config.json (lora_alpha / r).
  unload_adapter(name): zeroes the GPU slot and frees CPU cache.
  prepare_loras(lora_ids): copies active adapters into GPU slots on demand,
    returns weight_indices [bs] and scalings [n_slots]; evicts LRU non-pinned
    adapters when the pool is full.
  apply_qkv_lora / apply_o_lora: bmm-based delta application, TP-aware
    (column-parallel projections shard B; row-parallel o_proj shards A and
    all_reduces the partial output).

Model integration (qwen3.py)
  Qwen3Attention.forward injects LoRA delta after qkv_proj and o_proj when
  ctx.lora_manager is set.  layer_id stored on Qwen3Attention.

Context / executor (context.py, model_executor.py)
  ForwardContext gains lora_weight_indices, lora_scalings, lora_manager.
  ModelExecutor.execute_forward_op injects LoRA info into ForwardContext when
  any request in the batch carries a non-zero lora_id.

End-to-end routing
  TokenizedGenerateReqInput.lora_id — integer resolved at tokenize time
    from GenerateReqInput.lora_path via InputProcessor._resolve_lora_id().
  make_spec / RequestSpec.lora_id — scheduler receives per-request adapter id.
  EventLoop: init_lora_manager(), load_lora_adapter(), unload_lora_adapter();
    _request_lora_ids dict tracks rid→lora_id for active requests.
  RequestHandler: LoadLoraReqInput / UnloadLoraReqInput dispatch via callbacks.
  scheduler_control_client: load_lora_communicator / unload_lora_communicator
    + async load/unload methods on AsyncLLM.
  Engine.load_lora_adapter / unload_lora_adapter: delegate to tokenizer_manager.

Tested
  PEFT reference on GPU 2: adapter_0 (argon) produces the memorized password
    (Kx7#mP2$-VORTEX93qR-alpha!Z ≈ expected Kx7#mP2$-VORTEX-93qR-alpha!Z).
  tokenspeed serve --enable-lora starts cleanly on GPU 4,5 and serves requests.
  Base model correctly ignores adapters when lora_path is not set.

TODO (PR #2)
  - Route lora_path from OpenAI /v1/completions HTTP body through to lora_id.
  - Full integration test driving greedy output parity with PEFT.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Three fixes needed to run in eager mode (enforce_eager=True, disable_pdl=True
which are auto-set when --enable-lora is used):

1. server_args: auto-set disable_pdl=True when enable_lora is set.
   The TVM-JIT rmsnorm_cute kernel used by the PDL path is JIT-compiled
   on first call with a fixed dtype; in eager mode the dtype may differ from
   the CUDA-graph warmup call, causing a Mismatched Tensor error.

2. lora_manager: cast scale to the delta tensor's dtype before multiplying.
   bfloat16_delta * float32_scale promoted the result to float32, which the
   rope kernel cannot handle (DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP16 failure).
   Fix: (delta * scale.to(delta.dtype)).

3. qwen3.py: replace _apply_qk_norm kernel calls with a pure-PyTorch
   RMSNorm implementation (_rms_norm static method).  The flashinfer
   rmsnorm_cute kernel is JIT-compiled and its cached dtype cannot be
   changed at runtime; a simple x / rms * weight path avoids the kernel
   entirely and works with any dtype.

Also adds benchmark/test_lora_dynamic.py — end-to-end test demonstrating
dynamic load/unload of two adapters while the engine is live.  Confirmed:
  - load_lora_adapter() / unload_lora_adapter() work at runtime
  - LoRA weights ARE applied (different token IDs at generation position 7+
    vs base model: base→ "The password is", argon adapter → "1789...")
  - Prefix cache namespacing correct (different slots, isolated)

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…completions

Exposes lora_path in the OpenAI-compatible HTTP API so clients can select
a LoRA adapter per request without any server restart.

protocol.py
  - CompletionRequest.lora_path: str | None = None
  - ChatCompletionRequest.lora_path: str | None = None

serving_completions.py / serving_chat.py
  - Pass request.lora_path to GenerateReqInput so it flows through
    InputProcessor._resolve_lora_id() → lora_id → scheduler routing.

Usage example:
  curl http://localhost:8000/v1/completions \
    -d '{"model":"Qwen/Qwen3-8B","prompt":"...", "lora_path":"argon","max_tokens":30}'

model_executor.py
  - Fix per-token weight_indices expansion for mixed-adapter batches:
    repeat_interleave(w_idx, input_lengths) so every token in a prefill
    batch gets its request's correct adapter slot index, not just the
    first N requests' indices sliced to total_tokens.

lora_manager.py
  - Remove the broken per-token expansion from apply_qkv_lora/apply_o_lora;
    weight_indices is now always already per-token when it arrives.
    Single-request broadcast (1→tokens) is preserved.

benchmark/test_lora_batch.py
  - New test: load argon + bastion, verify each produces different token
    IDs from base model and from each other (adapter isolation proof).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Three correctness/cleanliness fixes to the virtual-root-per-adapter design:

1. Add KVPrefixCache::EvictLoraNamespace(lora_id): DFS-collects all
   descendant nodes, calls ResourceManager::EvictSubtree() to detach
   device/host pages (RAII auto-returns them to the allocator), then
   removes the virtual root via RemoveChild (unique_ptr cascade destroys
   the subtree including any mamba slots). Exposed as
   Scheduler::EvictLoraNamespace and bound to Python as
   scheduler.evict_lora_namespace(lora_id). Called from
   event_loop.unload_lora_adapter() so pages are freed immediately on
   unload rather than waiting for LRU pressure.

2. Remove device_.UpdateLeaves(raw) from getOrCreateLoraRoot: the call
   was a no-op (IsLeaf returns false for the empty-resource virtual root,
   and updateLeaf(real_root) returns immediately on IsRoot check).

3. Add EvictLoraNamespaceFreesPagesImmediately and
   EvictLoraNamespaceIdempotent tests. All 122 C++ tests pass.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Replace the per-token bmm LoRA path with sglang/Punica-style segmented
Triton kernels (sgemm_lora_a / sgemm_lora_b / qkv_lora_b) and refactor
LoraManager around a persistent LoraBatchInfo so the captured CUDA graph
can replay against stable buffer pointers.

* Move LoraManager creation into ModelExecutor.__init__ so graphs are
  captured with the LoRA path baked in (slot 0 = no-adapter, zero-delta
  via rank-0 short-circuit in the kernels).
* Bind ctx.lora_manager during _capture_one and pre-fill batch_info with
  one segment per "request" so all LoRA kernels are recorded.
* qwen3 attention now calls apply_qkv_lora / apply_o_lora with just
  (hidden, qkv, layer_id) — the manager owns batch_info.
* Drop the auto-disable of cuda graphs when --enable-lora is set.
* Single-GPU Qwen3-8B (TP=1, bs=1, 256 decode tokens, H100):
  eager+LoRA 36.7 → graph+LoRA 105.5 tok/s (2.87x).

Also threads lora_path through Engine.generate so the in-process Engine
API matches the HTTP routing that already lands lora_path in
GenerateReqInput.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Commit 126164b reintroduced a manual fp32 RMSNorm in ``_apply_qk_norm``
to dodge a JIT-dtype mismatch in the rmsnorm_cute (PDL) kernel under
``--enable-lora``.  Server args already auto-set ``disable_pdl=True``
for that path, so the regular flashinfer ``rmsnorm`` (used by
input_layernorm / post_attention_layernorm) is correct here too.

Restoring the fused kernel collapses ~7 small launches per call into
one.  Single-GPU Qwen3-8B (TP=1, bs=1, 256 decode tokens, H100):

  * eager + base:   47.7 → 57.4 tok/s  (+20%)
  * graph + base:  122.8 → 142.0 tok/s (+16%)
  * graph + LoRA:  105.5 → 118.8 tok/s (+13%)

Profile (eager): qk_norm dropped from 138 us / layer to 39 us / layer
(36 layers, 4.97 ms → 1.40 ms per decode step).

Aligns this branch with main, which already restored the fused path.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
When --enable-lora is on but no request in the current batch uses an
adapter, the captured CUDA graph still includes all the per-layer Triton
LoRA kernels (rank-0 short-circuit returns early but each kernel still
costs its replay-time launch slot — about ~5% / step).

Capture two graphs per batch size:
* graphs[bs]          — with-LoRA: ctx.lora_manager set, Triton calls baked in.
* graphs_no_lora[bs]  — same forward without the LoRA path.

LoraManager.prepare_loras updates a CPU-side has_active_lora flag from
the resolved per-request slots; the wrapper reads it before each replay
to pick the right variant.  Mixed batches (any segment with rank > 0)
fall back to the with-LoRA graph as before.

Single-GPU Qwen3-8B (TP=1, bs=1, 256 decode tokens, H100):

  * graph + no --enable-lora              : 142.0 tok/s
  * graph + --enable-lora, no adapter     : 134.5 → 138.4 tok/s
  * graph + --enable-lora, active adapter : 119.1 tok/s (unchanged)

Tradeoffs: 2× capture time at startup (~10s → ~20s); marginal extra
graph memory (the activations pool is shared via global_graph_memory_pool).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Extends LoRA to the MLP block of qwen3 in addition to attention.

Triton kernels:
* New gate_up_lora_b — fused 2-projection B expand for the stacked
  gate/up MLP linear (analogous to qkv_lora_b for attention).
* Reuses sgemm_lora_a (stack_num=2 for gate_up, 1 for down) and
  sgemm_lora_b (for down's full output expand).

LoraManager:
* _parse_adapter_weights now matches mlp.{gate,up,down}_proj keys.
* New per-layer buffers gate_up_A/B and down_A/B; un-sharded because
  qwen3 Qwen3MLP runs MergedColumnParallelLinear / RowParallelLinear
  with tp_size=1 (each rank holds the full intermediate weight).
* New apply_gate_up_lora and apply_down_lora — gate_up reuses the
  fused-B path; down has no internal all-reduce because there's no TP.

Bug fix (also affected attention):
* The sgemm_lora_a kernel only writes the first ``rank * stack_num``
  output cols, and qkv_lora_b / gate_up_lora_b read with stride
  ``stack_idx * actual_rank`` (after the kernel's K=min(K,rank) cap).
  _load_to_slot was packing stacks at multiples of MAX rank, which fell
  outside what the kernels actually read — silently zeroing the k/v
  deltas (and now would zero up's delta too).  Now packs stacks
  contiguously at ``stack_idx * actual_rank``, matching what sglang's
  weight loader does (mem_pool.py L873 ``[:lora_rank * c, :]``).

Qwen3MLP gains a layer_id and the forward call now threads through
``ctx`` so the LoRA hooks can be invoked.

E2E correctness on togethercomputer/Qwen3-8B-LoRA-Password-Adapters
(Qwen3-8B, TP=1, bs=1, H100):

* attn adapter: ' No other text.\nX7#mP2$VORTEX93qR\n...'
                (PEFT ref: 'Zx7#mP2$-VORTEX93qR\nNext, please ...')
* mlp adapter:  ' 73\nKx7#mP2$-VORTEX-93qR\nKx7#mP2$'
                (PEFT ref: ' 73\nKx7#mP2$-VORTEX-93qR\nKx7#mP2$-...')
                — bit-for-bit match for the first ~30 tokens.

Throughput (256 decode tokens):
  * graph + base                     : 142.0 tok/s
  * graph + attn LoRA  (q/k/v/o)     : 119.1 tok/s (post-stack-fix; was
    only-q before, so this is the *correct* number)
  * graph + mlp LoRA   (gate/up/down): 97.5 tok/s
  * sglang/tgl mlp LoRA: crashes with cudaErrorIllegalAddress on both
    csgmv and triton backends.

Memory: MLP buffers add ~672 MB at ``max_loras=2`` for Qwen3-8B
(intermediate=12288, hidden=4096, max_rank=64).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Batched ``engine.generate(prompt=[...], lora_path=[...])`` is split per
index by ``async_llm._handle_batch_request`` via ``obj[i]``.  The
``__getitem__`` method built the per-request sub-object but dropped
``lora_path``, so every sub-request ran as base model regardless of
which adapter the caller asked for.

Mixed-batch test on togethercomputer/Qwen3-8B-LoRA-Password-Adapters
(4 adapters + 1 base prompt in a single ``generate`` call):
  * before: 1/5 — only the base-model row passed; all four adapter
    rows produced base-model output.
  * after:  4/5 — three adapter rows emit their project's password
    fragment, base row correctly does not.  The remaining failure is
    a flaky adapter (bastion is just noisy under greedy decode — same
    behavior in isolation), not a routing bug.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Adds a CPU pinned-memory tier between the GPU LoRA buffers and the
adapter's disk path.  Adapters now flow:

    disk (always)  →  CPU pool (max_loras_cpu)  →  GPU pool (max_loras)

* CPU pool is bounded; LRU eviction drops the cached parsed weights and
  relies on _adapter_paths[name] to reload on next use.  The disk path
  is the source of truth and is assumed durable (S3 backing is a
  natural future replacement).

* Pinned adapters (passed `pinned=True` at load time) are protected
  from CPU eviction; non-pinned GPU-resident adapters can be CPU-evicted
  when the pool is otherwise full (their weights are still on GPU; a
  future GPU re-promotion costs a disk read).  Eviction prefers
  non-GPU-resident candidates first.

* Async prefetch hooks request admission: when a request with
  ``lora_id != 0`` is admitted, the manager kicks off a disk read on a
  ThreadPoolExecutor so the safetensors I/O is overlapped with the
  previous forward step instead of blocking ``prepare_loras`` of the
  step that consumes it.  prepare_loras joins an in-flight prefetch
  instead of double-reading.  Toggle with ``TOKENSPEED_LORA_PREFETCH=0``.

* New server args:
    --max-loras-cpu             default 4 × max_loras
    --lora-scheduling-policy    {lru} for now; the dispatch point
                                stays in event_loop for future
                                'admission' / 'pack' policies.

* Validation: max_loras_cpu must be ≥ max_loras (every GPU-resident
  adapter is also tracked in the CPU LRU; if max_loras_cpu == max_loras
  the policy-2 step lets us evict GPU-resident adapters from CPU when
  needed, instead of locking the pool).

E2E test (Qwen3-8B, max_loras=2, max_loras_cpu=2, three adapters
sequenced so the first is CPU-evicted then re-requested):
  * 1st argon: ' Kx7#mP2$-VORTEX93qR' → PASS  (initial)
  * 1st citadel: 'Tf3!hR6^-PRISM-27bK' → PASS
  * dagger: HELIX-fragments → noisy under greedy decode
  * 2nd argon (after CPU eviction + disk reload):
      ' Zx7#mP2$-VORTEX93qR' → PASS, matches the PEFT reference.

29 unit tests pass (incl. 8 new tests covering CPU LRU, disk reload,
pinned protection, prefetch path, and unload tear-down).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Adds the ``pack`` lora scheduling policy and a benchmark that
characterises the cost of each residence tier so users can size
``--max-loras-cpu`` for their workload.

Benchmark (Qwen3-8B, TP=1, max_loras=2, max_loras_cpu=3, max_lora_rank=64,
H100 80GB, 1-token decode):

    warm:          ~43 ms
    cpu-resident:  ~43 ms   (CPU→GPU copy is <1 ms, lost in the forward)
    cold (disk):   ~72 ms   (~30 ms safetensors read + parse)

Findings:
* CPU promotion is essentially free, so once an adapter is in the CPU
  pool there is no measurable per-request cost.  Sizing ``max_loras_cpu``
  to cover the working set eliminates the cold-disk hit entirely.
* Async prefetch only matters under multi-request concurrency: in
  serial single-request mode the prefetch's disk read still blocks the
  consuming request's prepare_loras.

``pack`` policy: in ``_process_new_requests`` the admitted-spec list is
stable-sorted by lora_id when ``--lora-scheduling-policy=pack``, so
adapter-shared requests cluster at the C++ scheduler.  Reduces GPU/CPU
eviction churn when ``working_set > max_loras_cpu`` and traffic is
bursty enough to put multiple cold requests in one event-loop iter.
``lru`` (default) keeps arrival order.

Skipped the ``admission`` policy: the benchmark shows GPU promotion is
free, so gating batches that don't fit in GPU buys nothing — the only
real eviction cost is CPU→disk, and that is already controlled by
``max_loras_cpu``.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…erving

# Conflicts:
#	python/tokenspeed/runtime/execution/model_executor.py
#	python/tokenspeed/runtime/models/qwen3.py
#	tokenspeed-scheduler/CMakeLists.txt
#	tokenspeed-scheduler/bindings/python_module.cpp
#	tokenspeed-scheduler/csrc/fsm/forward_events.cpp
…erving

Signed-off-by: Qingyang Wu <qingyang@together.ai>

# Conflicts:
#	python/tokenspeed/runtime/engine/io_struct.py
#	python/tokenspeed/runtime/entrypoints/openai/protocol.py
#	python/tokenspeed/runtime/entrypoints/openai/serving_chat.py
#	python/tokenspeed/runtime/entrypoints/openai/serving_completions.py
#	tokenspeed-scheduler/CMakeLists.txt
#	tokenspeed-scheduler/csrc/resource/kv_prefix_cache/kv_prefix_cache.cpp
#	tokenspeed-scheduler/csrc/resource/kv_prefix_cache/kv_prefix_cache.h
EvictSubtree referenced the old `leaves_` set removed by lightseekorg#18; switch to
the timestamp-keyed lru_leaves_/node_time_ cleanup used by updateLeaf so
the scheduler core compiles again and pip's editable build of
tokenspeed-scheduler succeeds.

Also apply clang-format 18.1.3 to files touched by the LoRA merge so the
lint job passes.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…erving

Resolved conflicts in KV/Hybrid prefix cache Match signatures by composing
both new params: lora_id (this branch, per-adapter namespacing) and intent
(main, distinguishes PrefixReuse from StateRecovery for retracted-request
recovery). Both call sites in forward.cpp (scheduleDecodeFromRetracted and
the post-allocation re-match) now pass request->LoraId() together with
MatchIntent::StateRecovery so retracted LoRA requests recover from their
own adapter namespace.

Also merged ForwardContext: kept the new last_index_offsets field from
main alongside the lora_manager field on this branch.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Per AGENTS.md the runtime should only cross the kernel boundary through
tokenspeed-kernel, and Triton imports should funnel through _triton.py.
Relocates the segment-grouped LoRA kernels from
python/tokenspeed/runtime/lora/triton_ops/ to
tokenspeed-kernel/python/tokenspeed_kernel/ops/gemm/lora_triton/ and
swaps the `import triton` lines for `from tokenspeed_kernel._triton`.
LoraManager now imports its kernels from the kernel package.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Two TP-correctness fixes uncovered when verifying the
Qwen3-8B-LoRA-Password-Adapters e2e suite at attn_tp_size=2.

1. Qwen3MLP is now TP-aware (gate_up_proj column-parallel, down_proj
   row-parallel; see runtime/models/qwen3.py). The LoRA buffers and
   slice offsets assumed the un-sharded layout, causing a shape mismatch
   in sgemm_lora_a during CUDA-graph capture and incorrect adapter
   semantics if the assert had not fired. The fix introduces
   intermediate_per_tp and:
     - sizes gate_up_B_buffers to (2 * intermediate_per_tp, r) per slot,
     - sizes down_A_buffers to (r, intermediate_per_tp) per slot,
     - passes intermediate_per_tp to gate_up_lora_b_fwd (the kernel
       already expected the per-rank output dim),
     - extends _shard_weights to slice MLP B (gate/up, column) and MLP
       A (down, row) the same way attention modules already were.

2. apply_o_lora previously computed the *full* B @ A @ x by all-reducing
   lora_a internally, then added that full delta to a partial base
   output. The host's downstream all-reduce in post_attention_layernorm
   then summed the delta tp_size times — pre-existing bug acknowledged
   in the old docstring, manifesting as garbled output for any attention
   adapter at TP > 1. Drop the internal all-reduce so each rank emits a
   partial (B @ A_local @ x_local) and rely on the existing downstream
   all-reduce to sum partials correctly; comm_all_reduce import is no
   longer needed.

Verified e2e against Qwen3-8B with attention and MLP adapters from
togethercomputer/Qwen3-8B-LoRA-Password-Adapters at attn_tp_size=2:
both modes produce the exact target passwords; base model does not
leak the secret; same-adapter re-queries after a different adapter is
loaded still resolve through the right namespace.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Adds ``@triton.autotune`` to all four LoRA kernels
(``sgemm_lora_a``, ``sgemm_lora_b``, ``qkv_lora_b``, ``gate_up_lora_b``),
keyed on the (output_dim, K) shape pair that drives tile selection.  The
candidate config sweep matches the space sglang found productive in
sgl-project/sglang#20391 (shrink: BLOCK_N×BLOCK_K×warps×stages; expand:
adds maxnreg for occupancy) plus a BLOCK_S axis since our kernel
exposes it.

Picks survive process restarts via ``configs/<gpu>/<kernel>.json``
checked into the package — on import ``load_kernel_cache`` populates
``Autotuner.cache`` so production never pays the sweep cost.  The
``tune.py`` driver runs each kernel with decode-shaped batches
(``bs=32, max_len=1``) for the Qwen3-8B shapes at attn_tp_size=2 and
writes the JSON; re-run it on a new GPU or model to extend the cache.

Bench on the lora_active config (Qwen3-8B, attn_tp=2, 32 prompts ×
128 out tokens, password adapter on every request):

    base                              5517 tok/s   23.2 ms/req
    --enable-lora, no lora_path       5210 tok/s   24.6 ms/req
    --enable-lora, lora_path (orig)   3201 tok/s   40.0 ms/req
    --enable-lora, lora_path (tuned)  3279 tok/s   39.0 ms/req   (+2.4%)

A modest win — the workload is decode-dominated (bs=32 single-token
segments), where launch overhead and per-step ``prepare_loras`` work
dwarf the block-size choice for these small matmuls.  Tuning at
prefill-shaped batches (bs=4, max_len=32) regressed by ~5%, confirming
that the block sizes are decode-vs-prefill sensitive; the committed
configs target decode.  Larger wins are still possible against the
non-kernel parts of the LoRA path (per-step host work, kernel launch
count) but those are out of scope here.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
``sgemm_lora_a``/``sgemm_lora_b`` was misleading on two axes — ``sgemm``
is BLAS for "single-precision (fp32) GEMM" (our kernel is bf16/fp16),
and ``_a``/``_b`` is PEFT terminology that's only obvious to LoRA
specialists.  Replace with operation-name files that read at first
glance:

  sgemm_lora_a.py    -> lora_shrink.py          (in_dim -> r)
  sgemm_lora_b.py    -> lora_expand.py          (r -> out_dim)
  qkv_lora_b.py      -> lora_qkv_expand.py      (fused QKV expand)
  gate_up_lora_b.py  -> lora_gate_up_expand.py  (fused gate/up expand)

Public ``*_fwd`` functions, internal ``_*_kernel`` symbols, and the
per-GPU autotune JSON config filenames follow the same scheme.  The
PEFT-style attribute names inside ``lora_manager.py``
(``qkv_A_buffers``, ``o_B_buffers``, etc.) and the tensor-parameter
names in the kernel signatures (``qkv_lora_b``, ``gate_up_lora_b``)
stay — those legitimately reference the PEFT ``lora_A``/``lora_B``
decomposition, not the operation.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
LoRA isn't really a GEMM variant — it's its own op family that happens
to use segmented matmuls under the hood.  Hosting the kernels under
``ops/gemm/lora_triton/`` overloaded the gemm family with LoRA-specific
buffers, batch_info, and Triton helpers.

Promote LoRA to a top-level family that follows the ``<family>/<solution>``
convention already used by ``ops/attention/triton/``:

  ops/gemm/lora_triton/  →  ops/lora/triton/

The kernel files, autotune configs, ``tuning.py`` cache loader, and
``tune.py`` driver all move together; only the import path changes.
``lora_manager.py`` in the runtime is updated to import from the new
location.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
The four LoRA Triton kernels (and ``kernel_utils.py``) were adapted from
sglang's ``python/sglang/srt/lora/triton_ops/`` (Apache-2.0), which in
turn descends from the Punica S-LoRA design.  Add file-level provenance
notes — upstream path, URL, license — and a package-level pointer in
``__init__.py``.  No code changes; attribution only.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Follow-up to the ops/lora/triton/ restructure — update the runtime
manager to import from the new location instead of ops/gemm/lora_triton.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Add chunked_sgmv_expand_fwd — a unified LoRA-B expand kernel that covers
plain, QKV, and gate/up projections via a NUM_SLICES constexpr and a
slice_offsets boundary tensor.  Making OUTPUT_DIM, MAX_RANK, NUM_SLICES,
and all strides constexpr lets the compiler specialise the K-loop trip
count at compile time, giving 2–3× speedup at prefill with rank ≥ 64
vs the runtime-stride decode kernels.

lora_manager dispatches on batch_info.max_len > 32: decode steps always
use the existing tuned kernels (11–25 µs); prefill uses chunked_sgmv.
Slice-offset tensors for each projection type are pre-allocated in
__init__ so dispatch adds zero per-step overhead, and the captured decode
CUDA graph is unaffected (max_len = 1 is always below the threshold).

Benchmarked on H100 at Qwen3-8B TP=2 shapes:
  prefill s=512 rank=64 QKV expand:  62 µs → 19 µs  (3.3×)
  prefill s=512 rank=64 gate/up:    110 µs → 35 µs  (3.1×)
  decode  s=1   rank=64 (unchanged): 34 µs

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
qywu added 30 commits May 19, 2026 18:09
Add lora_expand_grouped_v2_fwd: adapts vLLM's token-sorted dispatch
pattern (grid axis-1 = num_active_adapters) to eliminate the
gather/scatter overhead of lora_expand_decode_fwd.

Key design:
  • x and output accessed at scattered original token positions via
    token_indices — no pre-gather or post-scatter needed
  • Grid: (cdiv(M, BLOCK_S) × cdiv(N, BLOCK_N), num_groups)
    — tiles both M and N, matching vLLM's parallelism structure
  • CTAs beyond a group's token count exit immediately (same early-exit
    as vLLM's lora_expand_kernel)
  • Constexpr strides + tl.multiple_of EVEN_K from our prior work

Benchmarked vs vLLM inline + old grouped kernel (rank=64, N=4096, H100):
  n= 32  n_unique=4:  grpv2=  9.8µ  vllm=11.3µ  seg=22.2µ  (+12% vs vllm)
  n= 64  n_unique=4:  grpv2= 10.4µ  vllm=12.1µ  seg=36.2µ  (+14% vs vllm)
  n=128  n_unique=4:  grpv2= 12.7µ  vllm=13.2µ  seg=63.8µ  (+ 4% vs vllm)
  n=128  n_unique=1:  grpv2= 11.0µ  vllm=11.0µ  seg=62.9µ  (tied)

grpv2 wins in the common n_unique ≤ n/4 regime; vllm wins marginally
at extreme n_unique=n (all unique) corner cases, which the existing
dispatch threshold (bs // num_groups >= 8) already routes to segmented.

Replaces lora_expand_decode_fwd at both dispatch sites in lora_manager.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…iscompute

When the autotuner benchmarks BLOCK_K=64 for MAX_RANK=16, the original
K // BLOCK_K = 0 caused zero loop iterations and a silent no-op (correct
base_output returned but LoRA delta omitted).  The autotune then picked
this config as 'fastest' since it did nothing.

Fix: revert K // BLOCK_K -> tl.cdiv(K, BLOCK_K) and restore k_rem masks
so all BLOCK_K configs produce correct results.  Configs with BLOCK_K > K
are now slower (one masked iteration) and the autotuner naturally avoids
them in favour of BLOCK_K <= rank configs.

Verified: 176/176 correctness checks pass across n in {1..128},
n_unique in {1..n}, rank in {16,32,64,128}, N in {4096,8192}.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>

# Conflicts:
#	python/tokenspeed/runtime/models/qwen3.py
Summary of changes in this commit:

lora_expand_grouped_v2.py (correctness fix):
  Restore tl.cdiv(K, BLOCK_K) + k-masks from K // BLOCK_K, preventing
  the autotuner from selecting BLOCK_K > rank configs which silently
  produced zero-delta outputs.  Verified 176/176 correctness checks pass
  across n ∈ {1..128}, n_unique ∈ {1..n}, rank ∈ {16,32,64,128},
  N ∈ {4096,8192}.

lora_manager.py:
  Switch o_proj and down_proj decode dispatch from lora_expand_decode_fwd
  (gather/scatter) to lora_expand_grouped_v2_fwd (scattered reads, no copy).
  Add adapter-group metadata (sort_order, group_slots, group_starts,
  group_sizes, num_groups) to prepare_loras for the new kernel.

lora_expand.py / lora_qkv_expand.py / lora_gate_up_expand.py:
  Add BLOCK_K ∈ {64, 128} to expand config spaces (profiling showed
  0% BW utilisation — instruction-bound; BLOCK_K=64 eliminates the
  K-loop for rank=64 when combined with tl.cdiv).

bench_vs_vllm.py, profile_expand.py:
  Benchmark and profiling scripts comparing vs vLLM kernels.

End-to-end numbers (H100, rank=64):
  Decode n=32  expand grpv2 vs original:   11.2 µs → was 25.1 µs  (2.24×)
  Decode n=128 expand grpv2 vs original:   14.2 µs → was 63.0 µs  (4.45×)
  Prefill s=512 QKV expand vs original:    28.8 µs → was 61.0 µs  (2.12×)
  Prefill s=512 shrink vs original:        16.7 µs → was 23.4 µs  (1.40×)

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…eshold needed

The grouped v2 kernel previously used M = batch_info.bs (total tokens)
for the grid M dimension.  For n_unique = n (all different adapters) this
launched cdiv(n, BLOCK_S) × cdiv(N, BLOCK_N) × n CTAs with (BLOCK_S-1)/BLOCK_S
wasted per group, making it 2-3× slower than segmented.

Fix: use M = max(group_sizes) (pre-computed on CPU, no GPU sync) instead of
batch_info.bs.  When every group has 1 token (seg-like case), max_group_size=1
→ grid = (1 × cdiv(N,BLOCK_N), n) — identical to the segmented layout with
zero wasted CTAs.  The kernel now handles both extremes:

  n_unique = 1  (same adapter):   max_gs=n  → grpv2 layout, full M-tiling
  n_unique = n  (all different):  max_gs=1  → segmented layout, no waste
  n_unique = 4  (typical):        max_gs=n/4 → compact 4× fewer CTAs

Removes the _TRITON_GROUPED_DECODE_MIN_GROUP_SIZE = 32 threshold (set to 1)
since the kernel is now safe and optimal for all group sizes.

Results (rank=64, N=4096):
  n=128 n_uniq=128 (seg-like):  grpv2≈seg  75.5µ vs 76.9µ  (1.02×)
  n=128 n_uniq=  4 (typical):   grpv2 wins 15.0µ vs 63.3µ  (4.23×)
  n=128 n_uniq= 32:              grpv2 wins 27.4µ vs 66.8µ  (2.44×)

Also adds max_group_size: int to LoraBatchInfo and sets it in prepare_loras.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
lora_expand_decode_fwd (gather/scatter grouped expand) was fully
replaced by lora_expand_grouped_v2_fwd and is no longer referenced
anywhere in production code.  Remove the file and clean up the
__init__ export and doc references.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…perf fixes

Core changes:
- adapter_io: parse PEFT lora_embedding_A/B keys for lm_head; add LORA_HEAD_LAYER_ID sentinel
- lora_buffers: add 'lm_head' buffer group (lm_head_A/B_buffer, vocab_per_tp dim); column-parallel TP sharding
- lora_manager: add apply_lm_head_lora (single-slot matmul fast path; bmm fallback for mixed slots); skip H2D copies + cumsum in prepare_loras when has_active_lora=False
- logits_processor: wire apply_lm_head_lora before TP all-gather in _get_logits
- moe/layer: raise NotImplementedError for non-Triton backends with active LoRA
- server_args: add 'lm_head' to valid lora_buffer_groups; remove stale disable_pdl=True override (PDL works correctly with LoRA)

Tests: test_qwen3_lora_password_adapters covers attn/mlp/lm_head adapter types under sequential, batched, high-concurrency, and mixed-batch scenarios (72 subtests)

Perf: n_active=0 cudagraph now matches baseline (1170 vs 1171 tok/s) after removing two wasted GPU ops per step and re-enabling PDL

Signed-off-by: Qingyang Wu <qingyang@together.ai>
…totune picks

- Add BLOCK_S=8 to _EXPAND_CONFIGS: decode batches have S=8 tokens/segment,
  so BLOCK_S=16 wastes half the tile. The autotuner now considers the
  decode-optimal tile size.
- Cache autotune picks for N=6144 (QKV expand, q+2kv=4096+1024+1024) and
  N=24576 (gate_up expand, 2×12288) on H100 80GB HBM3. Both shapes were
  previously missing, triggering a live 648-config sweep on every fresh
  process. New picks consistently use BLOCK_S=8 for decode workloads.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
…H100

Add BLOCK_S=8 to _SHRINK_CONFIGS so the autotuner considers decode-batch
tile sizes. Re-ran autotune for all 16 Qwen3-8B shapes (rank 16/32/64/128
× K=4096/12288, QKV/gate_up stacks). Unlike the expand kernel, the shrink
kernel is K-bandwidth-bound (large hidden_size read), so BLOCK_S=16 remains
optimal — amortising the K-dimension read across more output rows wins.
No config changes from re-tuning; cache updated to reflect the wider search.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
…mark

- test_qwen3_moe_per_expert_lora_password_adapters: tests per_expert format
  (128 experts × 48 layers, 36864 weight tensors) with max_loras=2 due to
  ~1.96 GB/slot buffer footprint on Qwen3-30B-A3B. Covers sequential, 2-adapter
  batch, high-concurrency same-adapter, and mixed LoRA/base scenarios.
- bench_moe_lora_decode: compares per_expert vs sglang_shared formats across
  n_active=0/1/2/4/8 vs no-LoRA baseline. Results in 0521_moe_lora_results.md.
  Key finding: both formats ~83 tok/s at n_active=1 (−63% vs 226 tok/s baseline);
  n_active>2 collapses to ~2 tok/s due to CPU↔GPU swap thrashing with max_loras=2.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
…ton baseline

- bench_moe_lora_decode.py: parallel benchmark using mp.Process across 8 GPUs,
  batched 8-at-a-time. Tests per_expert vs sglang_shared formats, n_active=0/1/2,
  eager vs cudagraph, TP=1 vs TP=2, baseline with auto and triton moe_backend.
- 0521_moe_lora_results.md: partial results — all TP=1 eager done; cudagraph
  n_active=2 succeed; baseline/n_active=0/1 cudagraph OOM during graph capture;
  TP=2 triton baseline cudagraph: 1012 tok/s.
- test_qwen3_moe_per_expert_lora_password_adapters.py: correctness test for
  per_expert format (128 experts × 48 layers), max_loras=2 due to buffer size.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
All configs collected after sequential retry with corrected gpu_util
(0.75–0.82 for TP=1/2 LoRA to fit model + buffers in 80 GB H100).

Key results: TP=2 cudagraph per_expert n=1: 1081 tok/s (+1.25x vs TP=1);
CUDA graph vs eager LoRA: +11.8x; per_expert ≈ sglang_shared throughout.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
Signed-off-by: Qingyang Wu <qingyang@together.ai>
…format

When w13_A.shape[0] == 1 (sglang_shared gate/up shared outer A), replace:
  selected_A = w13_A[safe_ids]       # materialises (m, k, r, h) gather — 1 GB at rank=256
  lora_a = einsum("mh,mkrh->mkr", ...)
with:
  lora_a = (hidden @ A.T).unsqueeze(1).expand(-1, k, -1).contiguous()
which reads A once and broadcasts over top_k with no data copy.

Same for shared B (down_proj in sglang_shared): (m*k, r) @ B.T instead of
gather + einsum. Eliminates ~2/4 large gather tensors per forward pass,
reducing data movement from ~45 GB → ~1 GB per prefill step at rank=256.

Measured improvement: sglang_shared TTFT at rank=256 drops 553 ms → 266 ms
(−52%) on Qwen3-30B-A3B TP=2 eager. Correctness verified (8 tests, 30 subtests).

Signed-off-by: Qingyang Wu <qingyang@together.ai>
1. Shared A/B fast path: when w13_A or w13_B shape[0]==1 (sglang_shared
   format), skip the O(m*k*r*h) gather tensor and use a single matmul.
   For gate_up: hidden@A.T broadcast over top_k; for down: lora_a@B.T.

2. Gate_up B all-experts GEMM: when A is shared, compute delta for ALL E
   experts in one GEMM (m,r)@(r,E*n)->(m,E,n), then gather by safe_ids.
   Replaces 960 tiny per-pair GEMMs with one efficient batched matmul.

3. Remove torch.any(valid) sync: the early-return check caused 96 GPU->CPU
   synchronisations per prefill, stalling the pipeline on every layer call.
   Replaced with a lazy validity mask (built only when token_mask is set).

4. Vectorised scatter operations: _add_route_delta and _route_rows_from_cache
   replaced boolean-index gather with clamp+masked_fill+add_ / scatter_ with
   dummy row, giving 2-3x speedup (128us->57us, 135us->43us per call).

Result (TP=2 cudagraph, Qwen3-30B-A3B, bs=8, rank=16):
  sglang_shared n=1: TTFT 113ms -> 73ms (-35%), tput 975 -> 1093 tok/s
  per_expert    n=1: TTFT 116ms -> 76ms (-34%), tput 1081 -> 1110 tok/s
  n=0 (no active lora): TTFT within 2-4% of baseline (target: <=10%)
  Correctness: 8 tests, 30 subtests passed.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
Signed-off-by: Qingyang Wu <qingyang@together.ai>
The bmm reshape introduced a shape error: selected_A.reshape(mk, i, -1).T
gave (mk, r, i) which is incompatible with (mk, 1, i) in bmm — should be
(mk, i, r). Reverted to torch.einsum which is correct. The sync removal
and scatter optimizations remain in place.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
Two changes to make MoE LoRA work with CUDA graph decode:

1. cuda_graph_wrapper: force has_active_lora=True and single_lora_slot=0
   when capturing the with-LoRA graph (previously captured with
   has_active_lora=False → LoRA kernels omitted from graph entirely).

2. moe_lora: add dynamic GPU tensor indexing path for CUDA-graph-compatible
   weight access. MoeLoraContext now holds references to the per-layer buffer
   tensors (w13_A_buffers etc.). For single-slot batches, weight access uses
   weight_indices[:1].clamp(0) — a GPU tensor read captured in the graph that
   resolves to the correct adapter slot at replay time. Multi-slot batches
   fall back to Python dict lookup (eager-only path, unchanged).

Result: sglang_shared n=1 CUDA graph correctly generates 'PHOENIX-441...'
(password adapter output) instead of base model output. Throughput: 734 tok/s
vs 75 tok/s eager — ~10× speedup with correct LoRA application.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
…calings

Three micro-optimizations in _apply_gate_up_slot and _apply_down_slot:

1. Defer lora_a = expand+contiguous: only compute the (m,k,r) expanded view
   when the per-expert B path actually needs it. The all-experts and shared-B
   paths use lora_a_m (m,r) directly, saving one contiguous-copy kernel per
   layer.

2. Use repeat_interleave instead of expand+contiguous for shared-B: produces
   (m*k,r) directly without an intermediate non-contiguous expand.

3. Reuse slot_idx (already computed for w13_A/B gathers) for the scalings
   lookup, avoiding one extra clamp+gather kernel per layer.

These changes reduce the CUDA kernel count per layer but the measured step
time improvement is within noise (~0.03ms) — the bottleneck is the ~1100
aggregate LoRA kernels per decode step whose individual execution times sum
to ~5ms regardless of small per-kernel reductions.

Signed-off-by: Qingyang Wu <qingyang@together.ai>
Adds two Triton kernels in tokenspeed-kernel for the sglang_shared MoE
LoRA format that activate when the sorted (TMA) MoE dispatch is in use:

1. fused_gate_up_expand: replaces the all-experts B GEMM + candidates.gather
   + _add_route_delta chain (3 kernels) with a single per-sorted-position
   GEMV that writes directly to the sorted output. Reads only active experts'
   B weights (~40% of the all-experts GEMM bandwidth). Measured 2.3× faster
   in microbenchmark (71μs vs 166μs per layer call, rank=16).

2. fused_down_shrink: replaces _route_rows_from_cache + _select_expert_weights
   + einsum with a per-sorted-position GEMV on the sorted intermediate, then
   uses scatter_add_ to accumulate into token-ordered down output. Eliminates
   the 43μs route scatter. Measured 2.4× faster in microbenchmark (61μs vs
   146μs per layer call, rank=16).

Both kernels tile the rank dimension in BLOCK_R chunks to bound register
pressure for large ranks (r=16 to r=256). Both are CUDA-graph compatible:
scaling is loaded from a device tensor so graphs replay with updated adapters.

Integration in moe_lora.py:
- Activated when sorted_token_ids is not None (TMA-enabled MoE config)
- Single-slot CUDA graph path only (multi-slot falls back to existing code)
- Graceful import fallback (_FUSED_MOE_LORA_AVAILABLE) if kernel build fails

All existing accuracy tests pass (test_qwen3_moe_per_expert_lora, 18 subtests).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
moe_dispatch pre-allocates sorted_token_ids for all num_experts (128)
potential slots regardless of actual active experts.  This produces a
tensor larger than intermediate_cache (allocated with a tighter upper
bound based on actual token count).  Without truncation, _route_rows_from
_cache and _add_route_delta failed with shape mismatches, and _apply_down
_sorted had an OOB on topk_weights when sorted_token_ids contained the
sentinel value route_count (not just -1).

Fixes:
- _route_rows_from_cache: truncate sti to cache.shape[0] before scatter_
- _add_route_delta: truncate sti to output.shape[0] before gather
- _apply_down_sorted: truncate to inter_flat.shape[0], clamp flat_j_safe
  to [0, route_count-1] (sentinel route_count value → OOB without it)

Also adds the down MoE kernel config file for Qwen3-30B-A3B on H100 with
USE_TMA=false for all batch sizes.  TMA creates large padded intermediates
(up to 16× for BS=8 decode), which degrades decode performance; the fused
LoRA kernels activate only with sorted dispatch (USE_TMA=true), which is
appropriate for large-M prefill workloads rather than small-batch decode.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Adds _flat_gate_up_expand_kernel for the decode (flat-pair, non-TMA) path.
Unlike the all-experts GEMM which reads all 128 experts' B weights and then
gathers, this kernel computes tok = pid_s // K directly inside the kernel,
issues one GEMV per (token, topk) pair against only the assigned expert's B
rows.

Active-expert B bandwidth: ~51 experts × 1536 × 32 × 2 = 5 MB vs the
all-experts GEMM reading 12.6 MB (60% less).  Microbenchmark: 20μs vs 68μs
per call (3.4×), saving 2.3ms in the isolated gate/up path (×48 layers).

No argsort or sorted_token_ids overhead — computes the (tok, topk_v) lookup
from pid_s inline.  Works for any batch size and is CUDA-graph compatible.

Also retains the SCATTER-mode variant of _fused_gate_up_expand_kernel for
future TMA sorted-dispatch use.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
… gathers

Profiling showed 4 separate buffer gathers per layer × 48 layers = 3.08ms/step:
  w13_A gather   22µs   (copies 131 KB, needed for cuBLAS A GEMM)
  w13_B gather   38µs   (copies 12.6 MB, now eliminated by buffer+slot)
  down_A gather  22µs   (copies 1.6 MB, now eliminated by buffer+slot)
  down_B gather  22µs   (copies 131 KB, now eliminated by flat_down_expand)

After this commit only the w13_A gather / cuBLAS GEMM chain remained.
Two more changes eliminate it:

1. flat_a_gemm: new Triton kernel that computes lora_a_m = hidden @
   w13_A_buffer[slot, 0, :, :].T directly from the buffer without a
   prior 131 KB gather copy.  Grid (m, R//BLOCK_R).  ~5-8µs vs 47µs
   (22µs gather + 25µs cuBLAS).

2. scalings-buffer: _flat_gate_up_expand_kernel and _flat_down_expand_kernel
   now receive the full (n_slots,) scalings tensor and load scalings[slot]
   internally, eliminating the 19µs scalings[slot_idx] gather per layer × 2.

Combined e2e improvement (BS=8, sglang_shared rank=16, H100 TP=2):
  Before: 927 tok/s, 8.63 ms/step, LoRA overhead 2.90 ms
  After:  1107 tok/s, 7.22 ms/step, LoRA overhead 1.50 ms  (−48%)

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…merge

Main's scheduler changes (PagedCacheGroupFamily rename, KVPrefixCache API
changes) are not compatible with the LoRA branch's scheduler extensions.
Restore all scheduler csrc and bindings to the branch versions to keep
the build working. These scheduler changes can be reconciled separately.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…rsions

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
… and attn_output_gate

Three fixes for compatibility with Qwen3.5-35B-A3B LoRA adapters:

1. moe_lora.py (_load_2d_adapter_layer): auto-detect transposed A/B convention.
   Qwen3.5 per-expert adapters store A as (in_features, rank) and B as
   (rank, out_features), opposite the standard PEFT (rank, in_features) /
   (out_features, rank) layout.  Detect by shape: if A.shape[0] > A.shape[1]
   or B.shape[0] < B.shape[1], transpose before loading.

2. lora_manager.py: use model_config.head_dim when available.
   Qwen3.5-35B-A3B has hidden_size=2048, num_attention_heads=16, but
   head_dim=256 (decoupled).  The old head_dim = hidden // n_heads = 128
   was wrong, causing kv_size_per_tp=128 instead of 256.

3. lora_manager.py: account for attn_output_gate in q_size_per_tp.
   When attn_output_gate=True, the QKV proj uses 2×num_heads Q heads
   (q_proj output = 8192 for this model), but o_proj still uses 1×
   (o_proj input = 4096).  q_size_per_tp is doubled; o_in_per_tp is not.

Results on Qwen3.5-35B-A3B with Qwen3.5-35B-A3B-LoRA-Password-Adapters (TP=2):
  sglang_shared_outer: 8/8 PASS
  per_expert:          8/8 PASS (required the transpose fix)
  self_attn:           0/8 FAIL (adapter is inherently weak, loss ~1e-4 not 0)

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant